ConDoR: tumor phylogeny inference with a copy-number constrained mutation loss model

A tumor contains a diverse collection of somatic mutations that reflect its past evolutionary history and that range in scale from single nucleotide variants (SNVs) to large-scale copy-number aberrations (CNAs). However, no current single-cell DNA sequencing (scDNA-seq) technology produces accurate measurements of both SNVs and CNAs, complicating the inference of tumor phylogenies. We introduce a new evolutionary model, the constrained k-Dollo model, that uses SNVs as phylogenetic markers but constrains losses of SNVs according to clusters of cells. We derive an algorithm, ConDoR, that infers phylogenies from targeted scDNA-seq data using this model. We demonstrate the advantages of ConDoR on simulated and real scDNA-seq data. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-023-03106-5.


A Proofs
A.1 Proof to Theorem 2 in the Main text Before we start with the proof, we state a few definitions and results that would be useful in the proof.
Definition 2 (Binary factorization).Let B be a n × m multi-state character matrix for m characters with r + 1 states {0, 1, . . ., r} and state tree S with 0 as the root state.The binary factorization B ′ = [b ′ i,(j,s) ] of B is a n × mr binary matrix in which b ′ i,(j,s) = 1 for j ∈ [m], s ∈ [r] if b i,j = s ′ where s ⪯ S s ′ , i.e. s ′ is in the subtree of S rooted at s.
The following theorem from [1] relates the k-Dollo state tree with k-Dollo completion matrices.
Theorem 1 (El-Kebir 2018 [1]).Let B ∈ {0, . . ., k + 1} n×m be a k-Dollo completion matrix.Then there exists a multi-state perfect phylogeny T B for B such that each mutation j ∈ [m] has the k-Dollo state tree The following theorem describes a characterization of matrices that admits a binary perfect phylogeny.

Theorem 2 ( [2]
). Binary matrix A admits a perfect phylogeny if and only if no pair of columns (j, j ′ ) that contain the three gametes (0, 1), (1,0) and (1,1).Now, we provide the proof for Theorem 2 in the main text.We restate the theorem here for completeness and provide the proof in the following.Theorem 3.There exists a constrained k-Dollo phylogeny T with at most k SNV losses for A ∈ {0, 1} n×m and copy-number clustering σ : [n] → [p] if and only if there exists a k-Dollo completion B ∈ {0, . . ., k + 1} n×m of A that is consistent with σ.
Proof.(⇒) Let T be a constrained k-Dollo phylogeny T for binary matrix A and copy-number clustering σ with at most k losses.We construct a k-Dollo completion matrix B of A and show that it is consistent with σ.
We construct B as follows.For SNV j ∈ [m], let edges of T that induce a loss in SNV j be given by the set {e j,2 , . . ., e j,q }, where q ≤ k + 1 and the edges are indexed in a breadth first manner.For each leaf v ∈ L(T ) corresponding to row i in matrix A, we set b i,j = s if the edge e j,s is in the unique path from the root r(T ) to v.For each (i, j) ∈ [n] × [m], if a i,j = 1, we set b i,j = 1.For the rest of the entries we set b i,j = 0.It is easy to see that B is a k-completion of A.
Let us assume that B is not consistent with copy-number clustering σ.Then for some j ∈ [m], either condition (1) or condition (2) of Definition 4 in the main text is violated.
Case 1: Lets say condition (1) is violated.Then, there must exist two clusters ℓ and ℓ ′ such that b i,j = 0 and b i ′ ,j = 1 for distinct cells i, i ′ of these clusters.Since the set of nodes labeled by the clusters ℓ and ℓ ′ are disjoint, the mutation j must be gained at least twice in T .However, this contradicts the premise that T is a constrained Dollo phylogeny.
Case 2: Lets say condition (2) is violated.Then, there must exist two rows i and i ′ such that b i,j = s for s ∈ {2, . . ., k + 1}, b i ′ ,j ̸ = s and σ(i) = σ(i ′ ) = ℓ.Let v and v ′ be the leaves corresponding to the rows i and i ′ of B. Let r(ℓ) be the root of the subtree induced by the nodes labeled by ℓ in T .Consider the path from the root r(T ) to v i .The node r(ℓ) be lie of this path.As such, we can denote this path by P = {r(T ), . . ., r(ℓ), . . ., v i }.Since b i,j = s and s ∈ {2, . . ., k + 1}, by construction, there must exist an edge e ∈ P that induces a loss in SNV j.Moreover, since losses can only occur between nodes of T with distinct cluster labeling, edge e most lie in the path {r(T ), . . ., r(ℓ)}.Let b i ′ ,j = t ̸ = s.We have three cases, (i) t = 0, (ii) t = 1 and (iii) t ∈ {2, . . ., k + 1} \ {s}.

Case (i):
Since σ(v) = σ(v ′ ) = ℓ and T is a constrained Dollo phylogeny, r(ℓ) must lie in the unique path from r(T ) to v ′ .As such, the edge e which induces a loss of SNV j must lie on the path from r(T ) to v ′ .However since b i ′ ,j = 0, by construction, the SNV j is neither lost or gained along the path from r(T ) to v ′ , which is a contradiction.
Case (ii): Since b i ′ ,j = 1, by construction, we have a i ′ ,j = 1.Considering the path from r(T ) to v ′ , we have an edge e at which SNV j is lost in the subpath from r(T ) to r(ℓ).The SNV must be gained in an edge preceding e in the path P .Since a i ′ ,j = 1, the SNV must also be gained in an edge succeeding e in the path P .However, this contradicts the premise that T is a constrained Dollo phylogeny in which each SNV can be gained at most once.
Case (iii): Since b i ′ ,j = t and t ∈ {2, . . ., k + 1}, by construction, there must exist an edge e ′ in the path P ′ = {r(T ), . . ., r(ℓ), . . ., v ′ } that induces a loss in SNV j.Moreover, since losses can only occur between nodes of T with distinct cluster labeling, edge e ′ must lie in the path {r(T ), . . ., r(ℓ)}.By construction e and e ′ are distinct since s ̸ = t.Without loss of generality, let s < t.As such, the SNV must be gained once along the path r(T ) to the source node of e and once along the path from the target node of e to the source node of e ′ .However, this contradicts the premise that T is a constrained Dollo phylogeny in which each SNV can be gained at most once.
(⇐) Let B be a k-Dollo completion matrix of A that is consistent with copy-number clustering σ.For cluster ℓ ∈ [p], let σ −1 (ℓ) be the set of cells in that cluster, i.e.
We show that there exists a constrained k-Dollo phylogeny T for A and copy-number clustering σ.Equivalently, we show that there exists a state tree on the copy-number clusters such that the k-Dollo completion matrix B along with the character for the copy-number cluster admits a multi-state perfect phylogeny.Recall that k-Dollo completion matrix B admits a k-Dollo phylogeny and as such, the binary factorization B ′ of B with respect to the k-Dollo state tree admits a two-state perfect phylogeny.
First, we define a matrix R ∈ {0, 1, . . ., k + 1} p×m which represents the mutational state of the roots of the subtrees formed by each copy number cluster.We build R from the k-Dollo completion matrix B as follows, We show that the (n+p)×m augmented matrix B = [B; R] admits a multi-state perfect phylogeny with k-Dollo state tree S[k].This is equivalent to showing that the binary factorization a two-state perfect phylogeny.We show this by contradiction.If B ′ does not admit a perfect phylogeny, let (j, s) and (j ′ , s ′ ) be two columns of B ′ that have the three gametes (0, 1), (1, 0) and (1, 1).Note that B ′ admits a perfect phylogeny, since B is a k-Dollo completion matrix.As such, at least one of the rows for the three gametes must correspond to (r ′ ℓ,(j,s) , r ′ ℓ,(j ′ ,s ′ ) ) for some cluster ℓ.However, by construction of R, there must be at least one cell in cluster ℓ that has the same gamete.This would violate the 3-gamete condition for B ′ , which is a contradiction.Now we show that R ′ form roots of subtrees in the two-state perfect phylogeny for B ′ .Note that when a perfect phylogeny exists, it is unique [3].Moreover, B ′ admits a two-state perfect phylogeny since B is a k-Dollo completion matrix.As such, we only need to show that R ′ are roots of subtrees in the two-state perfect phylogeny for B ′ .Consider a cluster ℓ.By construction of R, it is clear that for any mutation j, the state r ℓ,j is the lowest common ancestor of the states of the cells i ∈ σ(ℓ) with respect to the state tree We only need to make sure that the subtrees formed by the 1-sets are either related by containment or disjoint of the subtrees defined by the clusters.By contradiction, suppose there is a mutation j for which the 1-set intersects with the two clusters ℓ and ℓ ′ such that r ℓ,j = 0 and r ℓ ′ ,j = 0.However, this would mean that there are two clusters ℓ and ℓ ′ where the mutation j is gained, which contradicts the premise that matrix B is consistent with the copy-number clustering σ.

A.2 Proof to Theorem 3 in the Main text
We restate the theorem here for completeness and provide a proof in the following.
Proof.We show this by reduction from the Flip problem [4] which is known to be NP-complete and is stated as follows: Problem 1 (Flip problem).Given a binary matrix D and an integer c ∈ N, does there exist a binary matrix B such that (i) B differs from D by at most c entries and (ii) B admits a perfect phylogeny.
Given a instance (D, c) of the Flip problem we construct an instance of the CkDP-RC problem with k = 0 as follows.We put all the cells i ∈ [n] in the same copy-number cluster, i.e.
The variant read count matrix Q and total read count matrix R as built as follows.For each entry with d i,j = 0, we set q i,j = q 0 and r i,j = r 0 , where q 0 and r 0 are chosen such that Pr(q 0 | r 0 , a i,j = 0) = µ > 0.5.For each entry (i, j) ∈ [n] × [m] with d i,j = 1, we set q i,j = q 0 and r i,j = r 0 , where q 0 and r 0 are chosen such that Pr(q 0 | r 0 , a i,j = 1) = µ > 0.5.Clearly this construction can be completed in polynomial time.
The likelihood of A is given by where 1(•) is the indicator function.Let c ′ be the number of entries where D and A differ.Then we have We show that the Flip problem instance (D, c) has a solution if and only if the corresponding CkDP-RC has a solution with likelihood (⇒) Let B be the solution to the Flip problem instance (D, c).From Eq. 13, the likelihood for mutation matrix A = B is given by µ nm−c (1 − µ) c .Moreover, B is a perfect phylogeny and therefore it is also a constrained 0-Dollo phylogeny with the constructed copy-number clustering σ.
(⇐) Let A be the solution to the CkDP-RC problem such that Then, suppose A differs from D in c ′ > c entries.Then from Eq. 13 we have Theorem 3.2 in [6] states the following.
Lemma 1. Matrix B ∈ {0, . . ., k + 1} is a k-Dollo completion matrix of some mutation matrix A if and only if the corresponding extended binary matrix B ′ admits a perfect phylogeny.
We now encode the extended binary matrix B ′ obtained from the k-completion matrix B of the mutation matrix A in our MILP formulation.Note that b ′ i,j + = 1 if and only if b i,j ≥ 1.As such, the entries b ′ i,j + are modeled by the variable x i,j .Further, b ′ i,j − s = 1 if and only if b i,j = s for s ∈ {2, . . ., k + 1}.As such, the entries b ′ i,j − s are modeled by c σ(i),j,s .We use the set inclusion and disjointness (SID) formulation described in [7] to enforce that the extended binary matrix B ′ is a perfect phylogeny.The SID formulation is based on the following theorem from [2].Here we provide the precise commands used to run competing methods for benchmarking on the simulated instances.Methods such as SPhyR, SCITE and SiFit take error rates, i.e. false positive rate and false negative rate, as input during inference.For simulated data, we empirically estimate the false positive rate and false negative rates (Figure S1) in the observed mutation matrix A ′ obtained in the simulations.We provide the aforementioned methods with the median false positive rate α = 0.0018 and median false negative rate β = 0.158 as input.Description of keywords used in the command-line arguments are provided in Table S1.Note that ConDoR is provided the accurate copy number clusters in the simulations as input, and SCARLET is provided the accurate copy number clusters and the ground-truth copy number tree in the simulations as input.

C Method parameters
ConDoR We run ConDoR with both false positive and false negative sequencing error rates set to 0.001.SPhyR We run SPhyR with the flag -lT (which sets the number of distinct rows in the inferred mutation matrix) set to m + p since that is the maximum number of events that can occur in the phylogeny.

D Additional simulations
We evaluate the performance and robustness of ConDoR towards (i) misspecification of the read count model and (ii) noise in the copy number clustering on simulations from a previous publication [8] that are publicly available at https://github.com/raphael-group/scarlet/tree/master/data/simulations.We describe the procedure employed by Satas et al. [8] to generate these simulations for completeness.Satas et al. were joined to the trunk in such a way that vertices in the same copy number clone formed connected subtrees.Mutations were assigned to the 20 edges that did not correspond to change in copy number state and losses were assigned to an edge with probability 0.5 if the loss is supported and present in the parent vertex.
Finally, the n = 100 cells were randomly assigned to the vertices of the simulated tree.
Satas et al. [8] simulate the read count data as follows.The total read counts are drawn from a Poisson distribution with mean of 100 to simulate 100× coverage.When the mutation is present in a cell, the variant read counts are drawn from a Binomial distribution with rate of success given by max{Beta(0.25, 0.25), 0.001} and when the mutation is absent in a cell, the variant read counts are drawn from a Binomial distribution with rate of success of 0.001.

D.1 Different read count model
To investigate the effect of the read count model on the performance ConDoR, we use two sets of simulations, (i) where the read counts are simulated using the negative binomial model described in the 'Simulation details' Section in the main text and (ii) where the read counts are simulated using the binomial model used by Satas et al. [8].

Case (i)
The results on the first set of simulations show that ConDoR and SCARLET are the top performing methods (Figure S2).We see that ConDoR has very similar performance compared to SCARLET (median 0.007 and 0.0065, respectively) in terms of the normalized mutation matrix error while the pairwise ancestral relation accuracy of ConDoR is higher than SCARLET (median 0.87 vs 0.84).Case (ii) Figure S3 shows the results on the simulations from Satas et al. [8].Under this setting, ConDoR outperforms all the methods except SCARLET.While the pairwise ancestral relation accuracy of SCARLET and ConDoR is very similar (median 0.8578 and 0.8578, respectively), we find that SCARLET outperforms ConDoR (median 0.0145 vs 0.0185) in terms of the mutation matrix error.However, this is expected because SCARLET is given the copy number clustering and the ground-truth copy number tree as input while ConDoR is only given the copy number clustering as input.Moreover, we use the same read count model to generate these simulations as the one used by SCARLET during inference.

D.2 Noise in copy number clustering
We evaluate the robustness of ConDoR by adding noise in the copy number clustering of the simulations described above that used the negative binomial read count model (Case (i) described in Section D.1).Let h ∈ {0.05, 0.1, 0.15, 0.2} be the fraction of cells for which we provide incorrect cluster information in the simulations.We uniformly at random pick h fraction of cells to assign incorrect copy number states.For each of the picked cells we assign one of the p − 1 incorrect copy number states that is drawn uniformly at random.We simulate 5 instances for each value of h. Figure S4 shows that ConDoR is robust to errors in the copy number clusters.In fact, even when 20% of the cells have incorrect copy number state (h = 0.2), ConDoR has lower normalized mutation matrix error (meidan 0.044) and higher pairwise ancestral relation accuracy (median 0.774) compared to SPhyR (0.046 and 0.737), SiFit (0.051; pairwise ancestral relation accuracy is not applicable for SiFit) and SCITE (0.044 and 0.705).

E Supplementary results
We have the following supplementary figures and tables.
• Figure S5 shows the runtime of ConDoR, SCARLET, SPhyR, SiFit and SCITE on the simulation instances ("Evaluation on simulated data" section in the Main text).
• Figure S6 shows the distribution of length and coverage of the 596 amplicons used in the PDAC study ("Multi-region Pancreatic ductal adenocarcinoma data" section in the Main text).
• Figure S7 shows the phylogeny constructed using COMPASS and SPhyR on the PDAC dataset ("Multi-region Pancreatic ductal adenocarcinoma data" section in the Main text).
• Figure S8 shows the phylogeny constructed using SCITE and SCARLET on metastatic CRC data ("Metastatic colorectal cancer data" section in the Main text).

5 .Definition 3 .
But this contradicts Eq. 14. Hence A must differ from D in at most c entries and thus is a solution to the Flip problem instance (D, c).B MILP formulation B.1 k-Dollo completion constraints Definition 3 in the main text provides O(k 4 ) distinct 3 × 2 submatrices that must be avoided in a k-Dollo completion matrix B. Incorporating this in the MILP will require O(n 3 m 2 k 4 ) constraints which is infeasible for large values of n, m and k.We instead employ an alternate characterization of k-Dollo completion matrices derived from the work of [5].In this characterization, for a given matrix B ∈ {0, . . ., k + 1} n×m , we define the n × m(k + 1) extended binary matrix B ′ such that B is a k-Dollo completion matrix for some mutation matrix A if and only if B ′ admits a perfect phylogeny.The extended binary matrix B ′ ∈ 0, 1 n×m(k+1) is obtained from the k-completion matrix B as follows: for each entry b i,j we have the entry b ′ i,j + and k entries b ′ i,j − s (for s ∈ {2, . . ., k + 1}) such that

Figure S1 :
Figure S1: Histogram of the (a) false positive rate α and the (b) false negative rate β in the observed mutation matrix A ′ of simulated instances.Red lines show the median false positive rate (0.0018) and median false negative rate (0.158).

[ 8 ]
simulated 50 datasets, each containing n = 100 cells, m = 20 mutations and p = 4 copy number clones.The set of losses supported between each pair of copy number clones were randomly drawn from a Poisson distribution with rate 0.2 × m.To simulate the topology of the phylogeny, m + p + 1 = 25 vertices were randomly assigned to the trunk or one of the p copy number clones.The vertices that were assigned to the trunk where joined in a linear path and the other vertices

Figure S2 :Figure S3 :Figure S4 :
Figure S2: (a) Normalized mutation matrix error and (b) pairwise ancestral relation accuracy for each method on simulations from Satas et al.[8], where the read counts were simulated using the negative binomial model (related to the Section 'Evaluation on simulated data' in the main text).

Figure S5 :
Figure S5: (a) Normalized Robinson-Foulds distance of the phylogenies inferred by all the methods from the groundtruth phylogeny.(b)Computational time taken by all the methods on simulated instances ("Evaluation on simulated data" section in the Main text).

Figure S6 :
Figure S6: Histogram of (a) length and (b) normalized read depth for the 596 amplicons used for targeted single-cell sequencing in the PDAC study ("Multi-region Pancreatic ductal adenocarcinoma data" section in the Main text).The complete panel covers 121743 bps across the genome.

Figure S9 :
Figure S9: (a) Silhouette score for increasing number of clusters in k-means clustering of cells using the gene-level average normalized read depth RG for the PDAC dataset.(b) tSNE plot showing the k-means clustering (that has 3 clusters) with the highest Silhouette score (0.96).Related to "Multi-region Pancreatic ductal adenocarcinoma data" section in the Main text.

Table S1 :
Description of the keywords used in the command-line arguments.